Table of Contents

Loans Data Exploration

Preliminary Wrangling

This document explores a dataset containing metadata related to loans such as:

Initialize

Install packages

Import libraries

Back to top

Load dataset

Google Colab

Local folder

Copy Raw Dataframe

Back to top

GLOBAL

Functions

Back to top

Data Structure

There are 113,937 entries, across 81 variables (as seen above) in this loans dataset. Below are the variables of interest during this exploration.

Duplicates

Back to top

Data Cleaning

Data dictionary variables of interest Summary:

Before continuing any further only columns of interest will be retained for the exploration to ensure only the required variables are modified.

Datatypes

Date time correction
Date Year and Month extraction
Combine pre and post 2009 ratings

There are ratings we were recorded pre 2009 and post 2009.

CreditGrade and ProsperRating, these two are pairs because they cannot be analysed separately due to each rating type being based on a specific time period, where the former was used for users pre 2009 and the latter was for post 2009. Quickly reviewing the two columns reveals there are Nan evident on either field whilst the other contains an entry. As such the dataframe will contain a new column combining the two, specifically the post 2009 overwritting the Nan values.

A dictionary will be created to assign and apply the relevant numerical values to the

Having Not Available and Income Displayed is redundant information that doesn't assist in the analysis of what affects BorrowRate.
With the Employment Status analysed, the next breakdown from Employment are the Occupations.

Category re-definition

Income range requires cleaning as evident with \$0 and Not employed values.

Column type definitions

Back to top

Univariate Exploration

NaN/Missing Data

First visual is to grasp the amount of NaN present in the current dataset to determine whether additional cleaning is required.

Back to top

Exploratory data analysis

The EDA will revolve around the exploration of the 14 variables mentioned previously.

Back to top

Numerical variables

Of the 500 random samples:

Of the quantitative variables above, we will investigate Prosper Score/Credit Grade, Borrowrate, MonthlyLoanPayment to clarify the correlations seen above against categorical variables that have not yet been considered. The heatmap below provides numeric evidence of the visual inspection of the pair grid.

Back to top

What affects a loan repayments interest rate/BorrowRate?

What factors greatly affect the BorrowRate and as a result the MonthlyLoanPayment of the customer?

Borrower Rate

Both graphs approximate the median BorrowRate to be 18%, both plots suggest outliers to be considered ~38% and above. This allows unless to optimize the bins to be between 0 and 0.4.

BorrowRate variable shows a multi-modal normal distribution, one towards the lower end where the bulk of the loaners are within 10% to 20%. The second towards the higher end with a peak at 31%.

The log scale transformation reveals the outliers not seen in the prior 2 graphs above it.

With the distribution of primary variable explored. The remaining categorical variables that were not able to be analysed in the pair plot will be investigated to determine possibly influences with BorrowerRate.

Back to top

Employment Status

The univariate exploration of EmploymentStatus reveals majority of the loan applicants are 'Employed', which is redundant as it splits further into Full-time Employed, Self-Employed, Part-time, other etc meaning the data classifications were not considered appropriately. The Bivariate exploration against BorrowRate, reveals the spread of employment with generally revolving around 10 to 30% +/- 5%, with a large peak at ~32%. The log transformation provides clarity for the lower values not otherwise seen in graph 2, the borrow rate of full-time employees receiving ~0.4% is interesting. If they were incorrectly entered or due to the low overall amount required to be borrowed that led to such a low rate.

It is evident that BorrowerRate is affected by EmploymentStatus given the following: -Of the 7 employment types, the median borrow rate of Not employed is significantly higher then the remaining, as well a large majority of it's spread situated at the higher borrow rates due to the increased risk of the individual paying it back. -Self-employed & Full-time employees having the lowest medians -Employed & Other are generalised as discussed previously.

It is interesting to see Other has the widest spread at ~32%, with Employed having the second highest width. The remaining employment types seem to taper off at the same amount.

Occupation

The above graph shows several types of Occupations, split into there appropriate position name. It is important to note the top two occupations are ambigious and uninformative i.e. Professionals and Other.

As there several Occupations to display, it will be quite cumbersome to display all the statistical distributions of each, example seen below.

Using the box plot above, we can see a better capture of the Occupations relationship to BorrowerRate. It is particularly evident that Occupations do impact borrow rates as they are a subset of EmploymentStatus evident with Judge and Doctors appearing to have the lowest median borrow rates, where as Student College Freshman and Teachers aid appear to have the highest.

Due to the quantity of the Occupations list, a select few professions will be assessed to ensure a wide proportion of the population is captured. This selection is visual based with attempts to collect a low and high borrow rate of each field i.e. business, law, engineering, public service, admin, etc.

Initially the two plots were on separate subplots which made comparing them side by side difficult. As such combining them onto the same plot allows for a better visual of the width of the violins, as well as the outliers, lower and upper quartiles shown by the box plot. It is apparent Occupations do impact the borrow rates issued out by Prosper, next variable to investigate is Income Range which directly ties to the amount each occupation is paid.

Prior to dropping Other and Professional occupations, an investigation into the spread of data will be performed to determine how it sits when compared against Borrower Rate.

Before dropping Other and Professional occupations from the dataframe, we observe the plot above showing a normal distribution and its potential influence to BorrowRate. The peak at ~32% would appear to the usual BorrowRate provided by Prosper.

Back to top

IncomeRange

The next category of interest is the Incomes of the respective Occupations above as well its overall spread.

From the plots above we can see a left skewed distribution, with the majority of incomes ranging almost equally between \$25,000-49,999 and \$50,000-75,000 .

The plot of BorrowerRate against IncomeRange reveals a pattern reinforcing the idea that Income affects the BorrowerRate provided by Prosper, explicitly that the higher your income guarantees you will have a lower borrow rate on your loan.

From the multi-variates, it is obvious that the higher income ranges, generally results in lower BorrowerRates. The only exception to these are Doctors, Clerical, Investors and Student - College Graduate Students. It is interesting to see how there are some loans approved despite being listed for having No income otherwise considered as \$0 as shown above.

Monthly Loan Payment & Term

From the above histogram plots we are able to see a heavily right tailed distribution, where the logarithmic scaling reveals the outliers ~\$1500 per month and above, the bulk of the customer repayments ranging around \$150 per month.

The table summary, violin & box plot above reveals the majority of Prosper clients requiring loans prefer 36 month/3 year terms. The plot reveals a significant quantity of outliers within 12 month/1 year terms, afterwards the outliers gradually decrease as the terms increase. The median monthly repayments range between as low as $100 to as high as $400 across all Occupations.

Delving deeper into the spread within each of the sample occupations mentioned earlier, we can see that Doctors have a large IQR (interquartile range) indicating a large population of doctors have loan repayments of substantial variance when taking out a 12 month/1 year loan, potentially indicating the professions potential buying power, spending habits/capital investments required as part of there profession, i.e. dental equipment amongst other various factors speculating if these Doctors start their own practice and need the large capital investment.

The graph below is the previously plotted graph as seen on the pair plot at the start of section 2. EDA, however this graph includes the term variable to reveal the distribution of repayments amounts based on the terms chosen. The plot below further supports the statement above where the people prefer the 36 month/3 year term, it reveals the spread of individuals with there respective loan repayments vs the borrower rate they have been given by Prosper. The 36 month/3 year term is significantly dominant soon followed by the 60 month/5 year then 12 month/1 year.

Back to top

CreditRating

CreditRating is a combination of ProsperRating & CreditGrade

The above joint plot supports the ordinal data hierarchy evident with an AA rating receiving the lowest BorrowerRates and HR i.e. high risk ratings receiving the highest. As highlighted during the plotting of the numerical variables, there is a high correlation between ProsperRating (numeric) i.e. a numerical rating of CreditGrade and ProsperRating (Alpha).

The bivariate kde plot of MonthlyLoanPayments is difficult to view and is plotted below, showing similar distributions for A, B & C ratings. All the graphs apart from the HR graph are shown to have right tailed distributions supporting the previous findings of clients having lower monthly repayments. The only rating that is not aligned with the others is the HR rating.

In the above occupations we can see the Prosper Ratings ranging between 3 and 5, which corresponds to a B, C and D rating. Doctors have the same IQR across the various employment status, Employed Judges has best box plot across across all occupations and employment types.

Back to top

BorrowerState

As borrower states are numerous only the top 6 states will be analysed.

Of the top 9 states requiring loans, CA appears to have requested the most loans with VA being the lowest of the 9 states.

Back to top

Debt to Income Ratio

Analysis of numerical variable against categoricals

From the logarithmic transformation above, we can see several plots past the ratio of 1, indicating there debts are greater then there available income. It would interesting to see what the ProsperRatingNum as well as the Borrow Rate are for these individuals including there Income Range.

From the above jointplot, for the clients with a debt to income ratio greater then 1, we can see there are a mix of all types of rating Credit ratings. We can confirm that the variable has no effect on Borrower Rate as a result.

Back to top

ListingCreationDate & ClosedDate

From the above histogram plots faceted based on years, we can see that 2013 was the year with the most loans closed which makes sense for the following reasons:

Back to top